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Abstract 

Learning based hashing methods have attracted consid- 
erable attention due to their ability to greatly increase the 
scale at which existing algorithms may operate. Most of 
these methods are designed to generate binary codes that 
preserve the Euclidean distance in the original space. Man- 
ifold learning techniques, in contrast, are better able to 
model the intrinsic structure embedded in the original high- 
dimensional data. The complexity of these models, and the 
problems with out-of-sample data, have previously rendered 
them unsuitable for application to large-scale embedding, 
however. 

In this work, we consider how to learn compact binary 
embeddings on their intrinsic manifolds. In order to address 
the above-mentioned difficulties, we describe an efficient, 
inductive solution to the out-of-sample data problem, and a 
process by which non-parametric manifold learning may be 
used as the basis of a hashing method. Our proposed ap- 
proach thus allows the development of a range of new hash- 
ing techniques exploiting the flexibility of the wide variety 
of manifold learning approaches available. We particularly 
show that hashing on the basis of t-SNE ^29^ outperforms 
state-of-the-art hashing methods on large-scale benchmark 
datasets, and is very effective for image classification with 
very short code lengths. 

1. Introduction 

One of many challenges emerging from the current ex- 
plosion in the volume of image-based media available is 
how to index and organize the data accurately, but also ef- 
ficiently. Various hashing techniques have attracted con- 
siderable attention in computer vision, information retrieval 
and machine learning [8 9 19j[3T][33), and seem to offer 
great promise towards this goal. Hashing methods aim to 
encode documents or images as a set of short binary codes, 
while maintaining aspects of the structure of the original 
data. The advantage of these compact binary representa- 
tions is that pairwise comparisons may be carried out ex- 
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tremely efficiently. This means that many algorithms which 
are based on such pairwise comparisons can be made more 
efficient, and applied to much larger datasets. 

Locality sensitive hashing (LSH) (8) is one of the most 
well-known data-independent hashing methods, and gener- 
ates hash codes based on random projections. With the suc- 
cess of LSH, random hash functions have been extended to 
several similarity measures, including p-norm distances [6], 
the Mahalanobis metric [17|, and kernel similarity (l6]|24 |. 
However, the methods belonging to the LSH family nor- 
mally require relatively long hash codes and several hash 
tables to achieve both high precision and recall. This leads 
to a larger storage cost than would otherwise be necessary, 
and thus limits the sale at which the algorithm may be ap- 
plied. 

Data-dependent or learning-based hashing methods have 
been developed with the goal of learning more compact 
hash codes. Directly learning binary embeddings typically 
results in an optimization problem which is very difficult 
to solve, however. Relaxation is often used to simplify 
the optimization {e.g., O] [3TJ . As in LSH, the methods 
aim to identify a set of hyperplanes, but now these hyper- 
planes are learned, rather than randomly selected. For ex- 
ample, PCAH (31), SSH (31], and ITQ (9) generate lin- 
ear hash functions through simple PCA projections, while 
LDAhash [3] is based on LDA. Extending this idea, there 
are also methods which learn hash functions in a kernel 
space, such as reconstructive embeddings (BRE) (15) , ran- 
dom maximum margin hashing (RMMH) (14) and kernel- 
based supervised hashing (KSH) (20) . In a departure from 
such methods, however, spectral hashing (SH) (33) , one of 
the most popular learning-based methods, generates hash 
codes by solving the relaxed mathematical program that is 
similar to the one in Laplacian eigenmaps (T). 

Embedding the original data into a low dimensional 
space while simultaneously preserving the inherent neigh- 
borhood structure is critical for learning compact, effective 
hash codes. In general, nonlinear manifold learning meth- 
ods are more powerful than linear dimensionality reduc- 
tion techniques, as they are able to more effectively pre- 
serve the local structure of the input data without assum- 
ing global linearity [26]. The geodesic distance on a man- 
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Figure 1: Top 10 retrieved digits for 4 queries (a) on a subset of MNIST with 300 samples. Search is conducted in the original 
feature space (b, c) and embedding space by t-SNE (29} (d, e) using Euclidean distance (b, d) and hamming distance (c, e). 



ifold has been shown to outperform the Euclidean distance 
in the high-dimensional space for image retrieval [10|, for 
example. Figure [T] demonstrates that searching using either 
the Euclidean or Hamming distance after nonlinear embed- 
ding results in more semantically accurate neighbors than 
the same search in the original feature space, and thus that 
low-dimensional embedding may actually improve retrieval 
or classification performance. However, the only widely 
used nonlinear embedding method for hashing is Laplacian 
eigenmaps (LE) {e.g., in [21 33,35]). Other effective man- 
ifold learning approaches {e.g., LLE |25|, elastic embed- 
ding [4j or t-SNE J29) ) have rarely been explored for hash- 
ing. 

One problem hindering the use of manifold learning for 
hashing is that these methods do not directly scale to large 
datasets. For example, to construct the neighborhood graph 
(or pairwise similarity matrix) in these algorithms for n 
data points is 0(n 2 ) in time, which is intractable for large 
datasets. The second problem is that they are typically 
non-parametric and thus cannot efficiently solve the criti- 
cal out-of-sample extension problem. This fundamentally 
limits their application to hashing, as generating codes for 
new samples is an essential part of the problem. One of 
the widely used solutions for the methods involving spec- 
tral decomposition {e.g., LLE, LE and ISOMap |27|) is the 
Nystrom extension [2|, which solves the problem by learn- 
ing eigenfunctions of a kernel matrix. As mentioned in (33), 
however, this is impractical for large-scale hashing since the 
Nystrom extension is as expensive as doing exhaustive near- 
est neighbor search (0(n)). A more significant problem, 
however, is the fact that the Nystrom extension cannot be 
directly applied to non-spectral manifold learning methods 
such as t-SNE. 

In order to address the out-of-sample extension prob- 
lem, we propose a new non-parametric regression approach 
which is both efficient and effective. This method allows 
rapid assignment of new codes to previously unseen data in 
a manner which preserves the underlying structure of the 
manifold. Having solved the out-of-sample extension prob- 
lem, we develop a method by which a learned manifold may 
be used as the basis for a binary encoding. This method 
is designed so as to generate encodings which reflect the 
geodesic distances along such manifolds. On this basis we 



develop a range of new embedding approaches based on a 
variety of manifold learning methods. The best perform- 
ing of these is based on manifolds identified through t-SNE, 
which has been shown to be effective in discovering seman- 
tic manifolds amongst the set of all images p9[ . 

Given the computational complexity of many manifold 
learning methods, we show that it is possible to learn the 
manifold on the basis of a small subset of the data B (with 
size to), and subsequently to inductively insert the remain- 
der of the data, and any out-of-sample data, into the em- 
bedding in 0(m) time per point. This process leads to an 
embedding method we label Inductive Manifold-Hashing 
(IMH) which we show to outperform state-of-the-art meth- 
ods on several large scale datasets both quantitatively and 
qualitatively. 



Related work Spectral Hashing Weiss et al. 1 33 1 formu- 
lated the spectral hashing (SH) problem as 

imn ^2 w ( x i> x j)llyi - yjll 2 (!) 

s.t. Y e {-1, l}" xr , Y T Y = nl, Y T 1 = 0. 

Here y; e {— 1, l} r , the ith row in Y, is the hash code 
we want to learn for Xj e R d , which is one of the n 
data points in the training data set X. W £ R" xn with 
Wy = w(xi,Xj) = exp(-||xj - Xj\\ 2 /a 2 ) is the graph 
affinity matrix, where a is the bandwidth parameter. I is the 
identity matrix. The last two constraints force the learned 
hash bits to be uncorrelated and balanced, respectively. By 
removing the first constraint (i.e., spectral relaxation [33 1), 
Y can be easily obtained by spectral decomposition on the 
Laplcaian matrix L = D W, where D = diag(Wl) 
and 1 is the vector with all ones. However, constructing W 
is 0{dn 2 ) (in time) and calculating the Nystrom extension 
for a new point is 0(rn), which are both intractable for 
large datasets. It is assumed in SH J33|, therefore, that the 
data are sampled from a uniform distribution, which leads 
to a simple analytical eigenfunction solution of 1-D Lapla- 
cians. However, this strong assumption is not true in prac- 
tice and the manifold structure of the original data are thus 
destroyed |2T) . 

Anchor Graph Hashing To efficiently solve problem ([TJ, 
anchor graph hashing (AGH) [21] approximated the affin- 



ity matrix W by the low-rank matrix W = Zyl _1 Z, where 
Z G R" xm is the normalized affinity matrix (with k non- 
zeros in each row) between the training samples and m an- 
chors (generated by K-means), and A" 1 normalizes W to 
be doubly stochastic. Then the desired hash functions may 
be efficiently identified by binarizing the Nystrom eigen- 
functions |2J with the approximated affinity matrix W. 
AGH is thus efficient, in that it has linear training time and 
constant search time, but as is the case for SH [33], the 
generalized eigenfunction is derived only for the Laplacian 
eigenmaps embedding. 

Self-Taught Hashing Self-taught hashing (STH) (35) ad- 
dressed the out-of-sample problem by a novel way: hash 
functions are obtained by training an SVM classifier for 
each bit using the pre-learned binary codes as class labels. 
The binary codes were learned by directly solving ([T| with 
a cosine similarity function. This process has prohibitive 
computational and memory costs, however, and training the 
SVM can be very time consuming for dense data. 

2. The proposed method 

2.1. Inductive learning for hashing 

Assuming that we have the manifold-based embedding 
Y := {yi, y 2 , • • • , y n } for the entire training data X := 
{xi, X2, • •• , x„}. Given a new data point x g , we aim to 
generate an embedding y q which preserves the local neigh- 
borhood relationships among its neighbors Njs(x 9 ) in X. 
We choose to minimize the following simple objective: 



Equation |4) provides a simple inductive formulation for the 
embedding: produce the embedding for a new data point by 
a (sparse) linear combination of the base embeddings. 

The proposed approach here is inspired by Delalleau et 
al. (7), where they have focused on non-parametric graph- 
based learning in semi-supervised classification. Our aim 
here is completely different: We try to scale up the manifold 
learning process for hashing in an unsupervised manner. 

The resulting solution Q is consistent with the basic 
smoothness assumption in manifold learning, that close- 
by data points lie on or close to a locally linear manifold 
p"| |25]|27] [. This local-linearity assumption has also been 
widely used in semi-supervised learning (7p4), image cod- 
ing (32], and similar. In this paper, we propose to apply this 
assumption to hash function learning. 

However, as aforementioned, Q does not scale well for 
both computing Y (0(n 2 ) e.g., for LE) and out-of-sample 
extension (0(n)), which is intractable for large scale tasks. 
Next, we show that the following prototype algorithm is 
able to approximate y q using only a small base set well. 
This prototype algorithm is based on entropy numbers de- 
fined below. 

Definition 1 (Entropy numbers JT2)). Given any Y C R r 
and p G N, the m-th entropy number e m (Y) ofY is defined 
as 

e m (Y) := inf{e > 0| W(e, Y, || ||) < m}, 

where N is the covering number. Then e m ( Y) is the smallest 
radius that Y can be covered by less or equal to m balls. 



e(y 9 ) = J> 



(x g ,Xi)||y 9 - y 4 



(2) 



Here we define 



w(x,,x. i ) 



exp(-||x g - XiH 2 / 0-2 ); if X,; € Nfc(x g ), 
otherwise. 



Minimizing |2) naturally uncovers an embedding for the 
new point on the basis of its nearest neighbors on the low- 
dimensional manifold initially learned on the base set. That 
is, in the low-dimensional space, the new embedded loca- 
tion for the point should be close to those of the points close 
to it in the original space. 

Differentiating C(y 9 ) with respect to y q , we have 



de(y q ) 



= 2^w(x 9 ,x l )(y*-y J ) = 0, (3) 



y<j=y. 



which leads to the optimal solution 

y * = EILi w ( x g> x »)yi 
9 IXi w ( x «> x *) 



2.1.1 The prototype algorithm 

Inspired by Theorem 27 of (T2) , we construct a prototype 
algorithm below. We use m clusters to cover Y. Let 01% — 
w(x,,x t ) — oriH n _ J2 ieI otj,. For each cluster index 



v : v and C, 



set Ij, we randomly draw ij — \_mCj + lj many indices 
from Ij proportional to their weight a^. That is, for /i G 
{ 1 , • ■ • , ij }, the ^-th randomly drawn index Uj,^ Pr(wj M = 
i) = g 1 , Mj £ {1, • • • , m}. We then construct y q as 



c 



(5) 



Theorem 2. For any even number n' < n. If Prototype 
Algorithm uses n' many non-zero y G Y to express y q , 
then 

Pr[||y 9 -y,||>t]< — ^ — ■ (6) 

Corollary 3. For an even number n', any e > e „/ (Y), any 

6 G (0, 1) and any t > 0, if n' > Mf, then with probability 
at least 1 — 5, 

< 4 > lly,-y 9 ||<*. 



Refer to the supplementary material for the proofs of the 
theorem and corollary. The quality of the approximation 
depends on (Y) and n'. If data exhibit strong clustering 

2 

patterns, i.e., data within each cluster are very close to clus- 
ter center, we will have small (Y), hence better approxi- 

2 

mation. Likewise, the bigger n' is, the better approximation 
is. 

2.1.2 Approximation of the prototype algorithm 

The clusters can be obtained via clustering algorithm such 
as K- means. Since the n could be potentially massive, 
it is impractical to compute a,; within all clusters. Let 
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Ideally, for each cluster, we 



want to select the that has high overall weight Oi = 
gx a i( x <?)- F° r l ar g e scale X, we only have lim- 
ited information available such as cluster centers {cj,j = 
1, • • • , m} and w(cj,x),x G X. Fortunately, the cluster- 
ing result gives useful information about Oj. The cluster 
centers have the largest overall weight w.r.t the points from 
their own cluster, i.e. J2i<=i w ( c ji x i)- This suggests we 
should select all cluster centers to express y q . 



Following many methods in the area (e.g., |21 33 1), we 
obtain our general inductive hash function by binarizing the 
low-dimensional embedding 



ft(x) = sgn 



'EJLi w(x,Cj)yj 

Efci w ( x » c j) 



(7) 



where sgn(-) is the sign function and Yb := {yi, y2, 
' • • i Ym} is the embedding for the base set B := {ci, Ca, 
• • • ,c m }, which is the cluster centers obtained by K-means. 
Here we assume that the embeddings y, are centered on the 
origin. We term our hashing method Inductive Manifold- 
Hashing (IMH). The inductive hash function provides a nat- 
ural means for generalization to new data, which has a con- 
stant 0(dm + rk) time. With this, the embedding for the 
training data becomes 



W 



XB 



(8) 



w(Xi ,Cj ) 



Y.T= i 



where Wxb is defined such that W» 
for Xj € X, Cj £ B. 

Although the objective function |2]i is formally related 
to LE, it is general in preserving local similarity. The em- 
beddings Yb can be learned by any appropriate manifold 
learning method which preserves the similarity of interest 
in the low dimensional space. We empirically evaluate sev- 
eral other embedding methods in Section [24] Actually, as 
we show, some manifold learning methods (e.g., t-SNE de- 



scribed in Section 2.2 1 can be better choices for learning 
binary codes, although LE has been widely used. We will 
discuss two methods for learning Yb in the sequel. 



Algorithm 1 Inductive Manifold-Hashing (IMH) 

Input: Training data X := {xi, X2, . . . , x n }, code length r, base set 
size m, neighborhood size k 

Output: Binary codes Y : = {yi, ya , . . . , y n } 6 R" x r 

1) Generate the base set B by random sampling or clustering (e.g. K- 
means). 

2) Embed B into the low dimensional space by l(9}, \12) or any other ap- 
propriate manifold leaning method. 

3) Obtain the low dimensional embedding Y for the whole dataset induc- 
tively by Equation j8j. 

4) Threshold Y at zero. 



We summarize the Inductive Manifold-Hashing frame- 
work in Algorithm 1 . Note that the computational cost is 
dominated by K-means in the first step, which is O(dmnl) 
in time (with / the number of iterations). Considering that 
Tji (normally a few hundreds) is much less than n, and is a 
function of manifold complexity rather than the volume of 
data, the total training time is linear in the size of training 
set. If the embedding method is LE, for example, then using 
IMH to compute Yb requires constructing the small affin- 
ity matrix Wb and solving r eigenvectors of the m x m 
Laplacian matrix Lb which is 0(dm 2 + rm). Note that 
in step 3, to compute Wxb, one needs to compute the dis- 
tance matrix between B and X, which is a natural output 
of K-means, or can be computed additionally in O(dmn) 
time. The training process on a dataset of 70K items with 
784 dimensions can thus be achieved in a few seconds on a 
standard desktop PC. 

Connection to the Nystrom method As Equation Q, 
the Nystrom eigenfunction by Bengio et al. |2| also gener- 
alizes to a new point by a linear combination of a set of low 
dimensional embeddings: 



(x) 



.7=1 



k(x,xj)v£e; 



For LE, V,. and S r correspond to the top r eigenvectors and 
eigenvalues of a normalized kernel matrix K with Kij — 

1 w(xj,Xj) 



K(x i ,x i 



In AGH 1 21 1, the 



" % /B x [w(x s ,x)]B 3c [w(x,x J )] ' 

formulated hash function was proved to be the correspond- 
ing Nystrom eigenfunction with the approximate low-rank 
affinity matrix. LELVM |5 | also formulate out-of-sample 
mappings for LE in a manner similar to Q by combining 
latent variable models. Both of these methods, and ours, can 
thus be seen as applications of the Nystrom method. Note, 
however, that our method differs in that it is not restricted 
to spectral methods such as LE, and that we aim to learn bi- 
nary hash functions for similarity-based search rather than 
dimensionality reduction. LELVM [5| cannot be applied to 
other embedding methods other than LE. 



2.2. Stochastic neighborhood preserving hashing 

In order to demonstrate our approach we now derive 
a hashing method based on t-SNE f29) , which is a non- 
spectral embedding method. t-SNE is a modification of 
stochastic neighborhood embedding (SNE) fl3) which aims 
to overcome the tendency of that method to crowd points 
together in one location. t-SNE provides an effective tech- 
nique for visualizing data and dimensionality reduction, 
which is capable of preserving local structures in the high 
dimensional data while retaining some global structures 
1 29 1. These properties make t-SNE a good choice for near- 
est neighbor search. Moreover, as stated in [30], the cost 
function of t-SNE in fact maximizes the smoothed recall 
1 30 1 of query points and their neighbors. 

The original t-SNE does not scale well, as it has a time 
complexity which is quadratic in n. More significantly, 
however, it has a non-parametric form, which means that 
there is no simple function which may be applied to out- 
of-sample data in order to calculate their coordinates in the 
embedded space. As was proposed in the previous subsec- 
tion, we first apply t-SNE to the base set B [29 1, 



mm 
y b 



Cxx)- Cxx is ignored since computing the similarity ma- 
trix within X costs 0(n 2 ) time. The smoothness between 
points in X is implicitly ensured by |Rj). 

Applying equation ([8} for yj , j G X to ( fTO) , we obtain 
the following problem 



mintracc(Y B (D B - W B )Y B ) 
+Atrace(Y B (D BX - W^ B W XB )Y B ) 



(11) 



where D B = diag(W B l) and D B x = diag(W B xl) are 
both m x m diagonal matrices. Taking the constraint in ([T]), 
we obtain 



min trace(Y B (M + AT)Y B ) 



(12) 



s.t. 



Y B Y B 



ml 



where M = D B -W B , T = D BX -Wx B W XB . The op- 
timal solution Y B of the above problem is easily obtained 
by identifying the r eigenvectors of M + AT corresponding 
to the smallest eigenvalues (excluding the eigenvalue with 
respect to the trivial eigenvector 1 |j We name this method 
IMH-LE in the following text. 



Pij log [ — ) . (9) 2.4. Manifold learning methods for hashing 



Here p.y is the symmetrized conditional probability in the 
high dimensional space, and is the joint probability de- 
fined using the t-distribution in the low dimensional embed- 
ding space. The optimization problem |9]) is easily solved 
by a gradient descent procedure. After we get embeddings 
Y B of samples x s ; e B, the hash codes for the entire dataset 
can be easily computed using |7l. It is this method which 
we label IMH-tSNE. 

2.3. Hashing with relaxed similarity preservation 

As in the last subsection, we can compute Y B consider- 
ing local smoothness only within B. Based on equation 
in this subsection, we alternatively compute Y B by consid- 
ering the smoothness both within B and between B and X. 
As in 1 7 1, the objective can be easily obtained by modifying 
([]} as: 



C(Y B )= Y, w^x^ly.-yj 2 (C BB ) 

+ a w (xi,Xi)llyi-yjll 2 (Cbx) 



XiGB.XjGX 



(10) 



where A is the trade-off parameter. C BB enforces smooth- 
ness of the learned embeddings within B while C B x en- 
sures the smoothness between B and X. This formula- 
tion is actually a relaxation of ([TJ, by discarding the part 
which minimizes the dissimilarity within X (denoted as 



In this section, we compare different manifold learning 
methods for hashing within our IMH framework. The com- 
parison results are reported in Figure|2] For comparison, we 
also evaluate the linear PCA within the framework (IMH- 
PCA in the figure). We can clearly see that IMH-tSNE, 
IMH-SNE and IMH-EE (with Elastic Embedding (EE) g)) 
perform slightly better than IMH-LE (Section |23j ). This is 
mainly because these three methods are able to preserve lo- 
cal neighborhood structure while, to some extent, prevent- 
ing data points from crowding together. It is promising that 
all of these methods perform better than an exhaustive £2 
scan using the uncompressed GIST features. 

Figure [2] shows that LE (IMH-LE B in the figure), the 
most widely used embedding method in hashing, does not 
perform as well as a variety of other methods (including 
t-SNE), and in fact performs worse than PCA, which is a 
linear technique. This is not surprising because LE (and 
similarly LLE) tends to collapse large portions of the data 
(and not only nearby samples in the original space) close 
together in the low-dimensional space. The results are con- 
sistent with the analysis in pl|29)- Based on the above ob- 
servations, we argue that manifold learning methods (e.g. t- 
SNE, EE), which not only preserve local similarity but also 
force dissimilar data apart in the low-dimensional space, are 
more effective than the popular LE for hashing. 

It is interesting to see that IMH-PCA outperforms PCAH 
[ 3 1 1 by a large margin, despite the fact that PCAH is per- 
formed on the whole training data set. This shows that the 
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CI FAR 



0.2 
0.18 
0.16 
0.14 
0.12 

0.1 




CI FAR @64 bits 



-* * * A- 



-e-IMH-tSNE 
-$-IMH-SNE 
-t-IMH-EE 
IMH-LE 
-^-IMH-LEg 

IMH-LLE 
-*HMH-DM 

IMH-PCA 

PCAH 
■ ■■GIST L2 scan 



16 32 48 64 80 96 112 128 
Code length 



Figure 2: Comparison among different manifold learning 
methods within our IMH hashing framework on CIFAR-10. 
IMH with the linear PCA (IMH-PCA) and PCAH (3TJ are 
also evaluated for comparison. For clarity, forlMH-LE in 



Section 2.3 we term IMH with the original LE algorithm 
on the base set B as IMH-LE B . IMH-DM is IMH with the 
diffusion maps of | [T8| . 

generalization capability of IMH based on a very small set 
of data points also works for linear dimensionality methods. 

3. Experimental results 

We evaluate IMH on four large scale image datasets: 
CIFAR-1C0 MNIST, SIFT1M (flj and GISTIM^ The 
MNIST dataset consists of 70, 000 images, each of 784 di- 
mensions, of handwritten digits from '0' to '9'. As a subset 
of the well-known 80M tiny image collection (28), CIFAR- 
10 consists of 60,000 images which are manually labelled 
as 10 classes with 6, 000 samples for each class. We rep- 
resent each image in this dataset by a GIST feature vec- 
tor (23) of dimension 512. For MNIST and CIFAR-10, the 
whole dataset is split into a test set with 1, 000 samples and 
a training set with all remaining samples. 

We compare nine hashing algorithms including the pro- 
posed IMH-tSNE, IMH-LE and seven other unsupervised 
state-of-the-art methods: PCAH (3TJ, SH (33), AGH (2TJ 
and STH (35), BRE (T5), ITQ (9), Spherical Hashing (SpH) 
1 11 1. We use the provided codes and suggested parame- 
ters according to the authors of these methods. Because our 
methods are fully unsupervised we did not consider super- 
vised methods in our experiments. Due to the high com- 
putational cost of BRE and high memory cost of STH, we 
sample 1, 000 and 5, 000 training points for these two meth- 
ods respectively. We measure performance by mean of av- 
erage precision (MAP) or precision and recall curves for 
hamming ranking using 16 to 128 hash bits. We also report 
the results for hash lookup using a Hamming radius within 2 
by Fl score (22): F\ = Imprecision- recall) j '(precision + 
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Figure 3: MAP results versus varying base set size m (left, 
fixing k = 5) and number of nearest base points k (right, 
fixing m = 400) for the proposed methods and AGH. The 
comparison is conducted on the CIFAR-10 dataset using 64- 
bits . 
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Table 1: MAP (%) evaluation of different base generating 
methods: random sampling vs. K-means. The comparison 
is performed on the CIFAR-10 dataset with code lengths 
from 32 to 96 and base set size 400. 



recall). Ground truths are defined by the category informa- 
tion for the labeled datasets MNIST and CIFAR-10, and by 
Euclidean neighbors for SIFT1M and GIST1M. 

Base selection In this section, we take the CIFAR-10 
dataset for example to compare different base generation 
methods and different base sizes for the proposed methods. 
AGH is also evaluated here for comparison. Table [TJ com- 
pares two methods for generating base point sets: random 
sampling and K-means on the training data. Not surpris- 
ingly, we see that the performance of our methods using 



K-means is better at all code lengths than that using random 
sampling. Also we can see that, even with base set by ran- 
dom sampling, the proposed methods outperform AGH in 
all cases but one. Due to the superior results and high effi- 
ciency in practice, we generate the base set by K-means in 
the following experiments. 

From Figure [5] we see that the performance of the pro- 
posed methods and AGH do not change significantly with 
both the base set size m and the number of nearest base 
points k. Based on this observation, for the remainder of 
this paper, we set m — 400 and k = 5 for our methods, 
unless otherwise specified. Also it is clear that IMH-LEb, 
which only enforces smoothness in the base set, does not 
perform as well as IMH-LE, which also enforces smooth- 
ness between the base set and training set. Note, however, 
that IMH-LEb is still better than AGH on this dataset. 

Results on CIFAR-10 dataset We report the compara- 
tive results based on MAP for hamming ranking with code 
lengths from 16 to 128 bits in Figure[4] We see that the pro- 
posed IMH-LE and IMH-tSNE perform best in all cases. 
Among the proposed algorithms, the LE based IMH-LE 
is inferior to the t-SNE based IMH-tSNE. IMH-LE is stiU 
much better than AGH and STH, however. ITQ performs 
better than SpH and BRE on this dataset, but is still inferior 
to IMH. SH and PCAH perform worst in this case, because 
SH relies upon its uniform data assumption while PCAH 
simply generates the hash hyperplanes by PCA directions, 
which does not explicitly capture the similarity information. 
The results are consistent with the complete precision and 
recall curves shown in the supplementary material. We also 
report the F\ results for hash lookup with Hamming radius 
2 It is can be seen that IMH-LE and IMH-tSNE also out- 
perform all other methods by large margins. BRE and AGH 
obtain better results than the remaining methods, although 
the performance of all methods drop as code length grows. 

Figure [3] shows the precision and recall curves of ham- 
ming ranking for the compared methods. We see that STH 
and AGH obtain relatively high precisions when a small 
number of samples are returned, however precision drops 
significantly as the number of retrieved samples increases. 
In contrast, IMH-tSNE, IMH-LE and ITQ achieve higher 
precisions with relatively larger numbers of retrieved points. 

We also show qualitative results of IMH and related 
methods on a sample query in Figure [6] As can be seen, 
IMH-tSNEH achieves the best search quality in term of vi- 
sual relevance. 

Results on MNIST dataset The MAP and F 1 scores for 
these compared methods are reported in Figure|7] As in Fig- 
ure)?] IMH-tSNE achieves the best results. On this dataset 
we can clearly see that IMH-tSNE outperforms IMH-LE by 
a large margin, which increases as code length increases. 
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Figure 4: Comparison of different methods on CIFAR-10 
based on MAP (left) and F\ (right) for varying code lengths. 

This further demonstrates the advantage of t-SNE as a tool 
for hashing by embedding high dimensional data into a low 
dimensional space. The dimensionality reduction procedure 
not only preserves the local neighborhood structure, but also 
reveals important global structure (such as clusters) [29]. 
Among the four LE-based methods, while IMH-LE shows 
a small advantage over AGH, both methods achieve much 



Method 


Train time 


Test time 






64-bits 


128-bits 


64-bits 


128-bits 




IMH-LE 


9.9 


9.9 


5.1 x 10~ ;> 


3.8 x 10" 


5 


IMH-tSNE 


16.7 


20.2 


2.8 x 10~ 5 


3.1 x 10" 


5 


SH 


6.8 


16.2 


5.8 x ur 5 


1.8 x 10" 


4 


STH 


266.1 


485.4 


1.8 x 10~ 3 


3.6 x i0" 


3 


AGH 


9.5 


9.5 
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5 


PCAH 


3.8 
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SpH 
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41.0 
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5 


ITQ 


10.4 


20.3 


6.9 x 10" 6 
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5 


BRE 


418.9 


1731.9 


1.2 x 10~ 5 
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5 



Table 2: Comparison of training and testing times (in sec- 
onds) on MNIST with 70K 784D feature points. K-means 
dominates the cost of AGH and IMH (8.9 seconds), which 
can be conducted in advance in practice. The experiments 
are based on a desktop PC with a 4-core 3.07GHZ CPU and 
8G RAM. 
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Figure 6: The query image (a) and the query results returned by various methods with 32 hash bits. False positive returns are 
marked with red borders. 
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Figure 5: Comparison of different methods on CIFAR- 10 
based on precision (left) and recall (right) using 64-bits. 
Please refer to the complementary for complete results for 
other code lengths. 



better results than STH and SH. ITQ and BRE obtain high 
MAPs with longer bit lengths, but they still perform less 
well for the hash look up F\. PC AH performs worst in 
terms of both MAP and the F\ measure. Refer to the sup- 
plementary material for the complete precision and recall 
curves which validate the observations here. 

Efficiency Table[2]shows training and testing time on the 
MNIST dataset for various methods, and shows that the lin- 



ear method, PCAH, is fastest. IMH-tSNE is slower than 
IMH-LE, AGH and SH in terms of training time, however 
all of these methods have relatively low execution times and 
are much faster than STH and BRE. In terms of test time, 
both IMH algorithms are comparable to other methods, ex- 
cept STH which takes much more time to predict the binary 
codes by SVM on this non-sparse dataset. 

Results on SIFT1M and GIST1M SIFT1M contains one 
million local SIFT descriptors extracted from a large set of 



images j 3 1 1, each of which is represented by a 128D vec- 



MN1ST 




48 64 80 
Code length 

MNIST 



112 128 




48 64 80 96 
Code length 



Figure 7: Comparison of different methods on the MNIST 
dataset using MAP (left) and F\ (right) for varying code 
lengths. 



SIFT1M 




48 64 ~&0 
Code length 

SIFT1M 



-IMH-tSNE 

-IMH-LE 

-AGH 




48 64 80 
Code length 



Figure 8: Comparative results results on SIFT1M for Fx 
(left) and recall (right) with hamming radius 2. Ground 
truth is defined to be the closest 2 percent of points as mea- 
sured by the Euclidean distance. 



tor of histograms of gradient orientations. GIST1M con- 
tains one million GIST features and each feature is repre- 
sented by a 960D vector. For both of these datasets, one 
million samples are used as training set and additional 10K 
are used for testing. As in pT) , ground truth is defined as 
the closest 2 percent of points as measured by the Euclidean 
distance. For these two large datasets, we generate 1,000 
points by K-means and set k = 2 for both IMH and AGH. 
The comparative results on SIFT1M and GIST1M are sum- 
marized in Figure[8]and Figure|9] respectively. Again, IMH 
consistently achieves superior results in terms of both F\ 
score and recall with hamming radius 2. We see that the 
performance of most of these methods decreases dramati- 
cally with increasing code length as the hamming spaces 
become more sparse, which makes the hash lookup fail 
more often. However IMH-tSNE still achieves relatively 
high scores with large code lengths. If we look at Figure [8] 
(left), ITQ obtains the highest F\ with 16-bits, however it 
decreases to near zero at 64-bits. In contrast, IMH-tSNE 
still manages an F\ of 0.2. Similar results are observed in 
the recall curves. 



Classification on binary codes In order to demonstrate 
classification performance we have trained a linear S VM on 
the binary codes generate by IMH for the MNIST data set. 
In order to learn codes with higher bit lengths for IMH and 
AGH, we set the size of the base set to 1, 000. Accuracies 



of different binary encodings are shown in Figure 10 Both 
IMH and AGH achieve high accuracies on this dataset, al- 
though IMH performs better with higher code lengths. In 
contrast, the best results of all other methods, obtained by 
ITQ, are consistently worse than those for IMH, especially 
for short code lengths. Note that even with only 128 -bit 
binary features IMH obtains a high 94.1%. Interestingly, 
we get the same classification rate of 94.1% applying the 
linear SVM to the uncompressed 784D features, which oc- 
cupy several hundreds times as much space as the learned 
hash codes. 

4. Conclusion 

We have proposed a simple yet effective hashing frame- 
work which provides a practical connection between man- 
ifold learning methods (typically non-parametric and with 
high computational cost) and hash function learning (requir- 
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Figure 9: Comparative results results on GIST1M by Fx 
(left) and recall (right) with hamming radius 2. Ground 
truth is defined to be the closest 2 percent of points as mea- 
sured by the Euclidean distance. 
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Figure 10: Classification accuracy (%) on MNIST with bi- 
nary codes of various hashing methods by linear SVM. 

ing high efficiency). By preserving the underlying man- 
ifold structure with several non-parametric dimensional- 
ity reduction methods, the proposed hashing methods out- 
perform several state-of-the-art methods in terms of both 
hash lookup and hamming ranking on several large-scale 
retrieval-datasets. The proposed inductive formulation of 
the hash function sees the proposed methods require only 
linear time (0(n)) for indexing all of the training data and 
a constant search time for a novel query. The learned hash 
codes were also shown to have promising results on a clas- 
sification problem even with very short code lengths. 

This work was in part supported by ARC Future Fellow- 
ship FT120100969. 
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